176 research outputs found
Normalized Web Distance and Word Similarity
There is a great deal of work in cognitive psychology, linguistics, and computer science, about using word (or phrase) frequencies in context in text corpora to develop measures for word similarity or word association, going back to at least the 1960s. The goal of this chapter is to introduce the normalized is a general way to tap the amorphous low-grade knowledge available for free on the Internet, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of search-queries. In the paper introducing the NWD it was called `normalized Google distance (NGD),' but since Google doesn't allow computer searches anymore, we opt for the more neutral and descriptive NWD. web distance (NWD) method to determine similarity between words and phrases
Hierarchical structuring of Cultural Heritage objects within large aggregations
Huge amounts of cultural content have been digitised and are available
through digital libraries and aggregators like Europeana.eu. However, it is not
easy for a user to have an overall picture of what is available nor to find
related objects. We propose a method for hier- archically structuring cultural
objects at different similarity levels. We describe a fast, scalable clustering
algorithm with an automated field selection method for finding semantic
clusters. We report a qualitative evaluation on the cluster categories based on
records from the UK and a quantitative one on the results from the complete
Europeana dataset.Comment: The paper has been published in the proceedings of the TPDL
conference, see http://tpdl2013.info. For the final version see
http://link.springer.com/chapter/10.1007%2F978-3-642-40501-3_2
CFT Duals for Extreme Black Holes
It is argued that the general four-dimensional extremal Kerr-Newman-AdS-dS
black hole is holographically dual to a (chiral half of a) two-dimensional CFT,
generalizing an argument given recently for the special case of extremal Kerr.
Specifically, the asymptotic symmetries of the near-horizon region of the
general extremal black hole are shown to be generated by a Virasoro algebra.
Semiclassical formulae are derived for the central charge and temperature of
the dual CFT as functions of the cosmological constant, Newton's constant and
the black hole charges and spin. We then show, assuming the Cardy formula, that
the microscopic entropy of the dual CFT precisely reproduces the macroscopic
Bekenstein-Hawking area law. This CFT description becomes singular in the
extreme Reissner-Nordstrom limit where the black hole has no spin. At this
point a second dual CFT description is proposed in which the global part of the
U(1) gauge symmetry is promoted to a Virasoro algebra. This second description
is also found to reproduce the area law. Various further generalizations
including higher dimensions are discussed.Comment: 18 pages; v2 minor change
Effect of heuristics on serendipity in path-based storytelling with linked data
Path-based storytelling with Linked Data on the Web provides users the ability to discover concepts in an entertaining and educational way. Given a query context, many state-of-the-art pathfinding approaches aim at telling a story that coincides with the user's expectations by investigating paths over Linked Data on the Web. By taking into account serendipity in storytelling, we aim at improving and tailoring existing approaches towards better fitting user expectations so that users are able to discover interesting knowledge without feeling unsure or even lost in the story facts. To this end, we propose to optimize the link estimation between - and the selection of facts in a story by increasing the consistency and relevancy of links between facts through additional domain delineation and refinement steps. In order to address multiple aspects of serendipity, we propose and investigate combinations of weights and heuristics in paths forming the essential building blocks for each story. Our experimental findings with stories based on DBpedia indicate the improvements when applying the optimized algorithm
An almost sure limit theorem for super-Brownian motion
We establish an almost sure scaling limit theorem for super-Brownian motion
on associated with the semi-linear equation , where and are positive constants. In
this case, the spectral theoretical assumptions that required in Chen et al
(2008) are not satisfied. An example is given to show that the main results
also hold for some sub-domains in .Comment: 14 page
Cumulants and the moment algebra: tools for analysing weak measurements
Recently it has been shown that cumulants significantly simplify the analysis
of multipartite weak measurements. Here we consider the mathematical structure
that underlies this, and find that it can be formulated in terms of what we
call the moment algebra. Apart from resulting in simpler proofs, the
flexibility of this structure allows generalizations of the original results to
a number of weak measurement scenarios, including one where the weakly
interacting pointers reach thermal equilibrium with the probed system.Comment: Journal reference added, minor correction
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
mspecLINE: bridging knowledge of human disease with the proteome
<p>Abstract</p> <p>Background</p> <p>Public proteomics databases such as PeptideAtlas contain peptides and proteins identified in mass spectrometry experiments. However, these databases lack information about human disease for researchers studying disease-related proteins. We have developed mspecLINE, a tool that combines knowledge about human disease in MEDLINE with empirical data about the detectable human proteome in PeptideAtlas. mspecLINE associates diseases with proteins by calculating the semantic distance between annotated terms from a controlled biomedical vocabulary. We used an established semantic distance measure that is based on the co-occurrence of disease and protein terms in the MEDLINE bibliographic database.</p> <p>Results</p> <p>The mspecLINE web application allows researchers to explore relationships between human diseases and parts of the proteome that are detectable using a mass spectrometer. Given a disease, the tool will display proteins and peptides from PeptideAtlas that may be associated with the disease. It will also display relevant literature from MEDLINE. Furthermore, mspecLINE allows researchers to select proteotypic peptides for specific protein targets in a mass spectrometry assay.</p> <p>Conclusions</p> <p>Although mspecLINE applies an information retrieval technique to the MEDLINE database, it is distinct from previous MEDLINE query tools in that it combines the knowledge expressed in scientific literature with empirical proteomics data. The tool provides valuable information about candidate protein targets to researchers studying human disease and is freely available on a public web server.</p
- …